Google Analytics Categorization

Categorizing ADC site page URL’s to more easily analyze user engagement.


Google Analytics Definitions

Page: The page shows the part of the URL after your domain name (path) when someone has viewed content on your website. For example, if someone views https://www.example.com/contact then /contact will be reported as the page inside the Behavior reports.
User: An individual visitor to the site (tracked using browser cookies)
Sessions: A single visit to the website, consisting of one or more pageviews, and any other interactions (The default session timeout is 30 minutes)
User % of Total: Users displayed as a percentage of the Total Users during the report period
Pageviews: The number of times users view a page that has the Google Analytics tracking code inserted. This covers all page views; so if a user refreshes the page, or navigates away from the page and returns, these are all counted as additional page views.
Unique Pageviews: The unique pageview is the count of all the times the page was viewed in an individual session as a single event. If a user viewed the page once in their visit or five times, the number of unique pageviews will be counted as just one
Entrances: Entrance represents the number of visits that started on a specific web page or group of web pages. I.e. the first page that someone views during a session
Bounce Rate: The Bounce Rate is Bounce measured in percentage. It represents the number of visits when users leave your site after just one page view, regardless of how long they stayed on that page. (Total Bounces divided by total visits)



Categorization Function

We will use the code and function below to categorize the Google Analytics dataset. The function takes messy character data within a dataframe and categorizes it based on a set of search string criteria. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these have to be correlated), and you have the option of naming the new column.

It is important to note that the order of the search strings matters for strings that are repeats – i.e. “catalog” and “catalog/submit” will be written over so you must identify the longer string first (i.e. catalog/submit). Additionally, make sure the order of the categories list correlates with the order of the search strings.
Source: https://github.com/lenwood



Identify Search Strings and Category Names

# List of search strings -- note that the longer search strings are identified first 
search <- c("news", "portals", "about","catalogprofile", "catalogsubmit", "catalog", "training", "team", "home", "view", "submit", "profile")

# List of categories
categories <- c("News", "Portals", "About", "Summary", "Submit", "Cathome", "Training", "Team", "Home", "Dataset", "WhoMustSub", "Summary")


Create Categorization Function

# Quickly categorize a data frame with a column of messy character strings. 

# Replace "df" with your messy dataframe.

categorizeDF <- function(df, searchColName, searchList, catList, newColName="Category") {
  # create empty data frame to hold categories
  catDF <- data.frame(matrix(ncol=ncol(df), nrow=0))
  colnames(catDF) <- paste0(names(df))

  # add sequence so original order can be restored
  df$sequence <- seq(nrow(df))

  # iterate through the strings
  for (i in seq_along(searchList)) {
    rownames(df) <- NULL
    index <- grep(searchList[i], df[,which(colnames(df) == searchColName)], ignore.case=TRUE)
    tempDF <- df[index,]
    tempDF$newCol <- catList[i]
    catDF <- rbind(catDF, tempDF)
    df <- df[-index,]
  }

  # OTHER category for unmatched rows
  if (nrow(df) > 0) {
    df$newCol <- "OTHER"
    catDF <- rbind(catDF, df)
  }

  # return to the original order & remove the sequence data
  catDF <- catDF[order(catDF$sequence),]
  catDF$sequence <- NULL

  # remove row names
  rownames(catDF) <- NULL

  # set Category type to factor
  catDF$newCol <- as.factor(catDF$newCol)

  # rename the new column
  colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
  catDF
}


Call Function and Categorize Data

# Replace "df" with messy dataframe

# Identify which column you want to categorize -- in our case with Google Analytics, we will be categorizing the "Page" column that contains messy URL strings. Additionally, you can name the new column that contains the categories (e.g. "Category").

sorted <- categorizeDF(df, "column name with messy data", search, categories, "new category column name")



Test Run of Categorization with Small Subset of Data

###### TEST DATASET ######


# Remove backslashes and other symbols from Page column (includes hyphens and periods). **** Not sure if this is necessary. Am trying to differentiate the single "/" as the ADC Homepage, and make it easier to identify search terms for the function below. 
test_users_clean <- top_30_users %>%
  mutate_all(funs(gsub("[[:punct:]]", "", .)))


# Rename home page as "home" in dataframe **NOTE that for this particular dataset the "Home" page is the top viewed page and so I put in [1]. If it is not the top viewed page you will need to determine which row the homepage is and put that row number in the brackets. *** Is there a better way to do this?? ***

test_users_clean$Page[1] <- "home"


### Categorize the page URLS in the Page column into larger categories using a function ###

## Create a list of search strings to sort through pages and a list of categories (these must be correlated) **Order matters for strings that are repeats -- i.e. "catalog" and "catalog/submit" will be written over so you must identify the longer string first (i.e. catalog/submit). 

# List of search strings
search <- c("news", "portals", "about","catalogprofile", "catalogsubmit", "catalog", "training", "team", "home", "view", "submit", "profile")

# List of categories
categories <- c("News", "Portals", "About", "Summary", "Submit", "Cathome", "Training", "Team", "Home", "Dataset", "WhoMustSub", "Summary")



## Create function [below] to categorize the messy "Page" column of the raw data frame. 
# This function takes looks at a data frame column of messy character (or factorial) data, and produces a new column of categorized data. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these two have to be correlated), and you have the option of naming the new column.


# Function:
categorizeDF <- function(test_users_clean, searchColName, searchList, catList, newColName="Category") {
  # create empty data frame to hold categories
  catDF <- data.frame(matrix(ncol=ncol(test_users_clean), nrow=0))
  colnames(catDF) <- paste0(names(test_users_clean))

  # add sequence so original order can be restored
  test_users_clean$sequence <- seq(nrow(test_users_clean))

  # iterate through the strings
  for (i in seq_along(searchList)) {
    rownames(test_users_clean) <- NULL
    index <- grep(searchList[i], test_users_clean[,which(colnames(test_users_clean) == searchColName)], ignore.case=TRUE)
    tempDF <- test_users_clean[index,]
    tempDF$newCol <- catList[i]
    catDF <- rbind(catDF, tempDF)
    test_users_clean <- test_users_clean[-index,]
  }

  # OTHER category for unmatched rows
  if (nrow(test_users_clean) > 0) {
    test_users_clean$newCol <- "OTHER"
    catDF <- rbind(catDF, test_users_clean)
  }

  # return to the original order & remove the sequence data
  catDF <- catDF[order(catDF$sequence),]
  catDF$sequence <- NULL

  # remove row names
  rownames(catDF) <- NULL

  # set Category type to factor
  catDF$newCol <- as.factor(catDF$newCol)

  # rename the new column
  colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
  catDF
}


# Call the function and create new data frame - using the raw data frame, the messy column you want to sort, the search and category lists, and name of the new column

sortedDF <- categorizeDF(test_users_clean, "Page", search, categories, "Category")


knitr::kable(sortedDF, format = "html")
Page Users Sessions Users_._of_Total Pageviews Unique_Pageviews Entrances Bounce_Rate Category
home 25436 42464 0440145 75951 55359 42443 0421957423 Home
catalog 4310 2280 0257363 12070 8734 2131 0247368421 Cathome
catalog 4130 416 0195397 33 26 19 0033653846 Cathome
data 3291 2306 0160785 19380 9507 2319 0273200347 OTHER
catalogdata 3114 3395 0139405 14923 7964 3298 0253608247 Cathome
about 2637 941 0123776 3942 3297 944 0582359192 About
team 1634 614 0110133 2554 2117 615 0684039088 Team
submit 1384 898 009936 3255 2395 901 0643652561 WhoMustSub
page0 1174 1580 0090577 5362 3813 1582 0158860759 OTHER
training 1166 892 0083537 2245 1639 892 0515695067 Training
publications 1120 431 0077705 1701 1326 432 0744779582 OTHER
share 1060 528 0072758 4532 2706 529 0357954545 OTHER
profile 989 232 0068477 1609 1338 238 061637931 Summary
qanda 912 214 0064713 1430 1181 214 0570093458 OTHER
january2019datasciencetrainingforarcticresearchers 903 1004 0061441 1359 1191 1004 0815737052 Training
datapage0 873 556 0058545 4704 2254 557 0303956835 OTHER
catalogprofile 799 193 0055914 1376 1121 188 0564766839 Summary
proposals 773 660 0053551 1187 1008 661 0762121212 OTHER
homehtm 735 800 0051402 936 817 800 056875 Home
support 729 121 0049463 1308 1004 122 058677686 OTHER
dataplans 685 384 0047672 932 827 384 0841145833 OTHER
2018datasciencetrainingforarcticresearchers 649 639 0046015 982 876 639 0723004695 Training
news201606datascienceopportunities 629 371 0044488 810 733 371 0851752022 News
upcomingdatasciencetrainingforarcticresearchers 599 612 0043066 857 767 612 0823529412 Training
catalogsubmit 582 302 0041746 1657 1196 304 0523178808 Submit
catalogportalspermafrost 548 651 0040505 874 722 650 0769585253 Portals
reconcilinghistoricalandcontemporarytrendsinterrestrialcarbonexchangeofthenorthernpermafrostzone 546 844 0039355 1189 995 844 0808056872 OTHER
viewdoi103334CDIAC00001V2017 522 562 0038272 769 619 562 0807829181 Dataset
catalogshare 521 399 0037263 1197 994 378 0483709273 Cathome
categorynews 512 197 0036317 822 672 197 0624365482 News





Visualizations for User Analysis


Total Users Over Time

Remember that Users are all individual visitors to the site tracked by browser cookies. If a User visits the site multiple times with the same browser, they will not be counted twice.


##  2016  2017  2018  2019  2020 
## 23772 32205 48337 31187 60330


Tree Map for Total Users by Category and Year


Tree Map for Total Users 2016-2020



Circular Bar Plot for Top 100 Users per Category in 2016




Circular Proportion Graph for Total Users by Category

WORK IN PROGRESS

# Create circular graph that shoes proportion of users within each category (34 total categories)

circos.clear() 


category = annual_sortedDF$Category
percent = sort(sample(40:80, 34))
color = rev(rainbow(length(percent)))


circos.par("start.degree" = 90, cell.padding = c(0, 0, 0, 0),
           canvas.xlim=c(-1.2, 1.2),   # bigger canvas?
           canvas.ylim=c(-1.2, 1.2)) 
circos.initialize("a", xlim = c(0, 100)) # 'a` just means there is one sector
circos.track(ylim = c(1, length(percent)+1), track.height = 0.9, 
    bg.border = NA, panel.fun = function(x, y) {
        xlim = CELL_META$xlim
        circos.segments(rep(xlim[1], 34), 1:34,
                        rep(xlim[2], 34), 1:34,
                        col = "#CCCCCC")
        circos.rect(rep(0, 34), 1:34 - 0.45, percent, 1:34 + 0.45,
            col = color, border = "white")
        circos.text(rep(xlim[1], 34), 1:34, 
            paste(category, " - ", percent, "%"), 
            facing = "downward", adj = c(1.05, 0.5), cex = 0.8) 
        breaks = seq(0, 85, by = 5)
        circos.axis(h = "top", major.at = breaks, labels = paste0(breaks, "%"), 
            labels.cex = 0.6)
})



Top 10 Datasets per Year


Dataset Users Pageviews Year
60 175 2016
36 65 2016
35 56 2016
32 133 2016
32 131 2016
32 91 2016
29 97 2016
28 66 2016
27 91 2016
26 50 2016